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Under  the  support  of  the  ONR  my  research  focused  on  extending  the  state  of  the  art  of 
probabilistic  topic  modeling,  algorithms  for  making  discoveries  from  and  predictions  about 
large  collections  of  texts.  For  the  past  three  years,  my  group  has  published  many  papers  in 
the  service  of  this  goal.  In  this  report,  I  will  highlight  some  of  the  themes  and  publications 
that  represent  this  work.  Thanks  to  the  support  of  the  ONR,  we  have  made  excellent  progress 
in  our  stated  goals. 


Bayesian  nonparametric  modeling 

Topic  models  discover  the  latent  themes  that  pervade  a  corpus  of  documents.  Our  broad 
goal  is  to  build  methods  that  can  be  applied  widely,  that  is,  to  many  kinds  of  corpora.  This 
motivates  our  development  of  Bayesian  nonparametric  (BNP)  models.  BNP  models  adapt 
the  cardinality  of  the  discovered  topics,  such  as  the  number  of  latent  topics  or  the  form 
of  the  topic  hierarchy,  to  the  corpus  at  hand.  (In  contrast,  traditional  parametric  models 
require  that  the  analyst  specify  this  structure  in  advance.)  We  have  developed  new  Bayesian 
nonparametric  topic  models  and  new  scalable  algorithms  for  BNP  modeling. 


•  Sam  Gershman  and  I  wrote  a  tutorial  about  Bayesian  nonparametrics  [12].  There  is  a 
high  bar  to  working  with  BNP  models,  as  the  literature  has  evolved  from  several  fields. 
We  hope  that  our  tutorial  will  provide  a  clear  introduction  to  the  main  ideas. 

•  We  have  explored  several  ways  of  building  dependence  into  BNP  models. 

Peter  Frazier  (Cornell)  and  I  developed  distance  dependent  Bayesian  nonparametric 
models  [1,  2,  13].  These  allow  external  data  sources  to  influence  the  latent  clustering 
(and  latent  feature  representation)  of  a  variety  of  data.  We  used  these  models  to 
capture  sequential  dependence  in  text  and  spatial  dependence  in  images.  We  released 
open-source  software  that  implements  our  algorithms. 
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In  other  work,  Lauren  Hannah  (Duke)  and  Warren  Powell  and  I  developed  Dirichlet 
process  mixtures  of  generalized  linear  models  [14,  15].  These  allow  covariates  to  affect 
the  clustering  of  a  response  and  exert  a  relationship  on  it.  DP-GLMs  allow  us  to  fit 
appropriately  complex  response  functions  in  prediction  problems,  fitting  nonlinearity 
via  several  linear  components. 

•  BNP  topic  models  are  hierarchical  mixed-membership  models  of  text,  usually  based 
on  the  hierarchical  Dirichlet  process.  One  thread  of  our  research  has  been  to  build 
hierarchical  BNP  models  that  relax  some  of  the  limiting  assumptions  of  the  original 
HDP. 

John  Paisley  (Berkeley),  Chong  Wang,  and  I  developed  the  Discrete  Infinite  Logistic 
Normal  (DILN),  which  is  a  new  kind  of  Bayesian  nonparametric  model  [19].  (This  paper 
won  a  Notable  Paper  Award  at  AI-STATS.)  DILN  allows  the  atoms  of  an  underlying 
random  measure  to  exhibit  correlation.  The  DILN  topic  model  is  a  BNP  variant  of  the 
correlated  topic  model,  allowing  the  appearance  within  a  document  of  latent  subjects 
(like  health  and  sports)  to  be  correlated.  Unlike  the  correlated  topic  model,  the  number 
of  topics  is  determined  by  the  data. 

In  other  work,  Chong  Wang  and  I  developed  a  hierarchical  BNP  topic  model  with  “spike 
and  slab”  priors  on  the  latent  topics  [20].  This  gave  better  predictive  performance, 
decoupling  the  sparsity  of  the  topics  and  their  smoothness,  i.e.,  decoupling  how  many 
words  a  topic  contains  from  how  confident  we  are  about  their  probabilities  within  it. 

•  Michael  Jordan  (Berkeley),  Tom  Griffiths  (Berkeley),  and  I  developed  hierarchical 
latent  Dirichlet  allocation  [3].  This  is  a  BNP  topic  model  that  finds  an  arbitrary  tree 
structure  (of  arbitrary  depth)  to  describe  the  topics  in  a  collection  of  documents.  The 
prior  distribution  we  developed  for  this  model — the  nested  Chinese  restaurant  process — 
illustrates  the  advantages  of  BNP  methods.  While  classical  methods  of  model  selection 
can  be  used  to  choose  a  simple  number  of  components,  these  methods  cannot  help  us 
search  over  the  arbitrary  space  of  tree  structures. 

In  more  recent  work,  Chong  Wang  and  I  developed  a  fast  variational  inference  algorithm 
for  this  model  [24],  This  is  the  first  variational  inference  method  that  searches  over 
combinatorial  structures  as  part  of  the  optimization. 

•  The  research  items  above  focus  on  mixture  models  or  mixed-membership  models.  We 
have  also  worked  on  latent  factor  models,  i.e.,  Bayesian  nonparametric  models  of  matrix 
factorization.  Sinead  Williamson  (Carnegie-Mellon),  Katherine  Heller  (Duke),  Chong 
Wang,  and  I  used  BNP  factor  models  with  mixed-membership  models  to  better  control 
sparsity  in  determining  how  many  topics  are  active  in  each  document  [25]. 

In  other  work,  John  Paisley  (Berkeley),  Larry  Carin  (Duke),  and  I  developed  a  fast 
variational  inference  algorithm  for  BNP  factor  models  [18]. 


Dynamic  topic  models 

We  continue  to  research  models  of  language  that  capture  how  language  changes  over  time. 
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•  Our  earlier  work  on  this  subject  assumed  that  time  was  discrete,  for  example  we 
analyzed  Science  year  by  year.  Chong  Wang  and  I  developed  a  continuous  time  dynamic 
topic  model  [22],  This  lets  documents  appear  with  time-stamps  at  arbitrary  granularity. 
Further,  we  can  model  language  change  at  multiple  resolutions. 

•  Sean  Gerrish  and  I  developed  a  dynamic  topic  model  that  captures  influential  docu¬ 
ments  [10].  Our  model  posits  that  an  influential  document  is  one  that  is  prescient  of 
how  language  changed.  For  example,  Einstein’s  first  paper  about  General  Relativity 
was  an  influential  paper  because  many  papers  discussed  it  subsequent  to  its  publication. 
Our  method  infers  the  influence  score  of  each  document  by  analyzing  a  large  corpus  of 
sequentially  ordered  documents. 

To  validate  our  method,  we  inferred  influence  scores  on  several  large  corpora  of  sci¬ 
entific  articles  and  measured  that  our  score  correlates  significantly  to  citation  counts. 
I  emphasize  that  our  scores  are  only  computed  from  the  language  of  the  articles 
themselves — our  model  could  be  used  to  find  influential  documents  in  corpora  that  do 
not  contain  citations.  The  Economist  reported  on  this  research  (“Organising  the  Web: 
The  Science  of  Science”  April  28  2011). 


Modeling  networks  and  text 

We  have  also  developed  new  models  of  networks  and  their  relationships  to  text. 


•  Jonathan  Chang  (Facebook)  and  I  developed  the  relational  topic  model ,  which  finds 
topics  that  respect  the  network  connectivity  of  the  documents  [7,  6],  Unlike  traditional 
network  models,  this  model  incorporates  node  content — it  can  predict  content  from 
links  and  links  from  content.  Jonathan  released  open  source  software  that  implements 
his  algorithm. 

•  Jonathan  Chang  (Facebook),  Jordan  Boyd-Graber  (University  of  Maryland)  and  I 
developed  a  model  that  discovers  the  social  network  hidden  inside  texts  [8].  The  idea  is 
to  use  named  entities  in  the  text  and  to  identify  when  two  named  entities  significantly 
co-occur.  Further,  we  find  patterns  of  words  that  describe  these  relationships.  For 
example,  we  analyzed  a  corpus  of  New  York  Times  articles  to  find  related  people  in  the 
news  and  to  describe  their  relationships.  The  model  discovered  relationships  described 
by  by  familial  words,  adversarial  words,  and  others. 

Models  of  text  and  other  variables 

Much  of  our  recent  work  centers  around  using  other  kinds  of  variables  to  help  anchor  text 
models,  and  to  use  text  models  to  predict  other  kinds  of  variables. 
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•  Chong  Wang  and  I  developed  a  new  method  for  collaborative  filtering.  Our  method 
uses  both  user  preferences  and  content  about  the  items  [21].  This  work  won  the  Best 
Student  Paper  Award  at  KDD  2011. 

•  Sean  Gerrish  and  I  built  a  model  of  legislative  roll  call  data  (i.e.,  votes  on  bills)  and 
bill  texts  [11],  This  extends  classical  quantitative  political  science  models,  which  only 
model  votes.  This  work  won  a  Distinguished  Application  Award  at  ICML  2011.  We 
are  continuing  to  work  on  this  area,  building  a  new  exploratory  model  of  legislators 
that  gives  descriptions  of  how  their  votes  deviate  from  otherwise  typical  patterns. 

•  Jordan  Boyd-Graber  (University  of  Maryland)  and  I  have  developed  several  methods 
for  combining  natural  language  processing  data  with  topic  models.  In  one  project, 
we  modeled  multi-lingual  corpora  [4].  In  another,  we  modeled  constraints  based  on 
dependency  parses  and  latent  topics  [5]. 


Current  efforts 

We  have  currently  turned  our  attention  to  two  important  problems. 


•  First  we  are  examining  scalable  computation  for  topic  models.  Matt  Hoffman  (Columbia), 
Francis  Bach  (INRIA),  and  I  developed  stochastic  variational  inference  for  Latent 
Dirichlet  allocation  [16].  This  algorithm  lets  us  analyze  massive  document  collections, 
including  document  collections  that  arrive  in  a  never-ending  stream.  Chong  Wang 
and  I  extended  this  algorithm  to  the  hierarchical  Dirichlet  process,  enabling  us  to  fit 
Bayesian  nonparametric  models  to  massive  data  [23]. 

•  Second  we  are  examining  how  we  can  better  use  topic  models  for  interpretative  and 
exploratory  tasks,  and  examining  how  we  might  make  this  problem  mathematically 
rigorous  and  well-defined.  Jonathan  Chang  (Facebook),  Jordan  Boyd  Graber  (Uni¬ 
versity  of  Maryland),  Chong  Wang,  Sean  Gerrish,  and  I  implemented  a  large-scale 
user  study  with  Amazon’s  Mechanical  Turk  to  assess  how  interpretable  topic  models 
can  be  [9].  This  was  the  first  evaluation  of  unsupervised  learning  for  interpretation 
with  Mechanical  Turk.  (Since  this  paper,  others  have  reproduced  and  emulated  our 
experimental  set-up.) 

In  more  recent  work,  David  Mimno  and  I  have  explored  posterior  predictive  checks  for 
topic  models  [17].  This  promises  to  be  an  automated  way  to  assess  which  topics  are 
interpretable,  without  needing  to  run  a  user-study  for  each  fitted  model. 
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